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Abstract 

VVe  study  efficient  deterministic  executions  of  parallel  algorithms  on  restartable  fail-stop  CRCW  PRAMs.  We  allow 
the  PRAM  processors  to  be  subject  to  arbitrary  stop  failures  and  restarts,  that  are  determined  by  an  on-line 
adversary,  and  that  result  in  loss  of  private  memory  but  do  not  alfect  shared  memory.  For  this  model,  we  define  and 
justify  the  complexity  measures  of;  completed  work,  where  processors  are  charged  for  completed  fixed-size  update 
cycles,  and  overhead  ratio,  which  amortizes  the  work  over  necessary  work  and  failures.  This  framework  is  a  nontrivial 
extension  of  the  fail-stop  no-restart  model  of  [KS  89]. 

We  present  a  simulation  strategy  for  any  W  processor  PRAM  on  a  restartable  fail-stop  P  processor  CRCW 
PRAM  such  that:  it  guarantees  a  terminating  execution  of  each  simulated  N  processor  step,  with  0(log^  A)  overhead 
ratio,  and  0(min{A  -|-  Plog^  N  +  M  log  N,  N  ■  7’°®})  (snb-c|uadratic)  completed  work,  where  M  is  the  number  of 
failures  during  this  step’s  simulation.  This  strategy  is  work-optimal  when  the  number  of  simulating  processors  is 
P  <  N/log^  N  and  the  total  number  of  failures  per  each  simulated  N  processor  step  is  0(A/log  N).  These  results 
are  based  on  a  new  algorithm  for  the  Write-AU  problem  “P  processors  write  I’s  in  an  array  of  size  A”,  together 
with  a  modification  of  the  main  algorithm  of  [KS  89]  and  with  the  techniques  in  [KPS  90,  Shv  89]. 

We  observe  that,  on  P  =  A  restartable  fail-stop  processors,  the  Write- All  problem  requires  f2(Alog  A)  com¬ 
pleted  work,  and  this  lower  bound  holds  even  under  the  additional  a.ssumption  that  processors  can  read  and  locally 
process  the  entire  shared  memory  at  unit  cost.  Under  this  unrealistic  assumption  we  have  a  matching  upper  bound. 
The  lower  bound  also  applies  to  the  expected  completed  work  of  randomized  algorithms  that  are  subject  to  on-line 
adversaries.  Finally,  we  desribe  a  simple  on-line  adversary  that  causes  inefficiency  in  many  randomized  algorithms. 


1  Introduction 

Context  of  this  work: 

The  model  of  parallel  computation  known  a.s  the  Par¬ 
allel  Random  Access  Machine  or  PRAM  [FW  78]  ha.s 
attracted  much  attention  in  recent  years.  Many  effi¬ 
cient  and  optimal  algorithms  have  been  designed  for  it, 
see  the  surveys  [EG  88,  KR  90].  The  PRAM  is  a  conve¬ 
nient  abstraction  that  combines  the  power  of  parallelism 
with  the  simplicity  of  a  RAM,  but  it  ha.s  several  unre- 
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alistic  features.  The  PRAM  requires:  (1)  global  syn¬ 
chronization,  (2)  simultaneous  access  across  a  significant 
bandwidth  to  a  shared  resource  (memory),  and  (3)  that 
proces.sors,  memory  and  their  interconnection  must  be 
perfectly  reliable.  The  gap  between  the  abstract  mod¬ 
els  of  jiarallel  computation  and  realizable  parallel  com¬ 
puters  is  being  bridged  by  current  research.  For  ex¬ 
ample,  memory  access  simulation  in  other  architectures 
is  the  subject  of  a  large  body  of  literature  surveyed  in 
[Val  90a],  for  some  recent  work  see  [IIP  89,  Ran  87, 
Upf  89].  Asynchronous  PRAMs  are  examined  in  [CZ  89, 
CZ  90,  Gib  89,  MSP  90,  Nis  90];  this  research  on  syn¬ 
chronization  is  related  to  the  study  of  parallel  reliable 
compulations,  which  is  the  subject  of  this  paper. 

Here,  we  continue  and  extend  the  study  of  fault  tol¬ 
erance  that  was  initiated  in  [KS  89]  and  show  that  arbi¬ 
trary  PRAM  algorithms  can  be  efficiently  and  deter¬ 
ministically  executed  on  restartable  fail-stop  PRAMs 
(who.se  processors  are  subject  to  arbitrary  dynamic  pat¬ 
terns  of  failures  and  restarts).  As  it  was  shown  in 
[K.S  89],  it  is  po.ssible  to  combine  efficiency  and  faiilt- 
toieranre  in  many  key  PRAM  algorithms  in  the  presence 
of  arbitrary  dynamic  fail-stop  processor  errors  (when 
processors  fail  by  stopping  and  do  not  perform  any  fur- 
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ther  actions). 

It  was  determined  that  efficient  and  fault-tolerant  so¬ 
lutions  for  a  certain  basic  problem  are  fundamental  in 
making  efficient  parallel  algorithms  fault-tolerant.  This 
problem  is  the  Write-All  problem; 

Given  a  P -processor  PRAM  and 
a  0-valued  array  of  N  elements, 
write  value  1  into  all  array  locations. 

This  problem  wets  formulated  to  capture  the  essence  of 
the  computational  progress  that  can  be  naturally  ac¬ 
complished  in  unit  time  by  a  PRAM  (when  P  =  AT). 
Thus,  in  the  absence  of  failures,  this  problem  is  solved 
by  a  trivial  and  optimal  parallel  assignment.  However, 
fault-tolerant  solutions  that  must  be  efficient  for  worst 
case  adaptive  adversaries  are  non-obvious. 

The  iterated  Write-All  paradigm  was  employed  (in¬ 
dependently)  in  [KPS  90]  and  [Shv  89]  to  extend  the 
results  of  [KS  89]  to  arbitrary  PRAM  algorithms  (sub¬ 
ject  to  fail-stop  errors  without  restarts).  In  addition  to 
the  general  simulation  technique,  [KPS  90]  analyzed  the 
expected  behavior  of  several  solutiotis  to  Write-All  us¬ 
ing  a  particular  random  failure  model.  [Shv  89]  presents 
a  deterministic  optimal  work  execution  of  PRAM  algo¬ 
rithms  subject  to  worst  ca.se  failures  given  parallel  slack¬ 
ness  (as  in  [Val  90b]). 

A  simple  randomized  algorithm  that  serves  as  a  basis 
for  simulating  arbitrary  PRAM  algorithms  on  an  asyn¬ 
chronous  PRAM  is  presented  in  [MSP  90].  Note  tliat 
this  asyenhronous  simulation  has  very  good  expected 
performance  for  the  problem  of  this  paper  when  the  ad¬ 
versary  is  off-line.  Recently,  [KPRS  90]  further  refined 
the  results  of  [KPS  90]  to  produce  an  approach  that 
leads  to  constant  expected  slowdown  of  I’RAM  algo¬ 
rithms  when  the  power  of  the  adversary  is  restricted. 
[KPRS  90]  has  also  improverl  the  fail-stop  deterministic 
lower  and  upper  bounds  of  [KS  89]  (by  log  log  fV  fac¬ 
tors). 

The  general  problem  of  a.ssigning  active  processors  to 
tasks  has  similarities  to  the  problems  of  resource  allo¬ 
cation  in  a  distributed  setting.  Distributed  controllers 
have  been  developed  for  resource  allocation  such  as  the 
algorithms  of  [LGFfl  8fi]  (in  a  probabilistic  setting),  and 
[AAPS  87]  (in  a  deterministic  setting).  Fault-tolerance 
of  particular  network  architectures  is  also  studied  in 
[DPPU  86]  However,  the  underlying  rlistributed  com- 
[uitation  models,  the  algorithms  and  their  analysis  are 
quite  different  from  the  parallel  set  ting  studied  here. 

Finally,  the  work  presented  here  deals  with  dynamic 
patterns  of  faults  —  for  recent  advances  on  coping  with 
static  fault  patterns  see  [K*  90].  VVe  consider  fault 
granularity  at  the  procc.s.sor  level  —  for  recent  work  on 
gate  granularities  see  [AU  90,  Pip  85,  Rud  85]. 


Contributions: 

We  allow  PRAM  processors  to  be  subject  to  on-line  (dy¬ 
namic)  failures  and  restarts.  Our  failure/restart  errors 
are  not  the  same  as  the  errors  of  omission  because  pro¬ 
cessors  lose  their  state  after  a  failure,  while  errors  of 
omission  cause  a  processor  to  skip  a  number  of  stejis 
without  losing  its  context. 

We  concentrate  on  the  worst  case  analysis  of  the  com¬ 
pleted  work  of  deterministic  algorithms  that  are  sub¬ 
ject  to  arbitrary  adversaries,  and  on  the  overhead  ratio, 
which  amortizes  the  work  over  the  necessary  work  and 
failures/restarts. 

In  our  model  processors  fail  and  then  restart  in  a 
way  that  makes  it  possible  to  develop  terminating  al¬ 
gorithms,  while  relaxing  the  requirement  that  one  pro¬ 
cessor  must  never  fail.  We  account  for  the  work  per¬ 
formed  by  the  processors  in  a  way  that  discounts  trivial 
adversaries  that  would  otherwise  force  quadratic  work 
Write-All  solutions.  To  guarantee  algorithm  termina¬ 
tion  and  sensible  accounting  of  resources,  we  introduce 
an  update  cycle,  that  generalizes  the  standard  PRAM 
read/compute/write  cycle.  In  Section  2,  we  first  de¬ 
fine  the  model  and  associated  complexity  measures,  and 
then  discuss  the  reasons  for  the  choices  made.  The  dis¬ 
cussion  motivates  the  use  of  update  cycles,  the  only  non- 
obvious  technical  choice  made. 

The  trivial  quadratic  lower  bound  cited  above  is  based 
on  a  thrashing  adversary.  It  depends  on  the  adver¬ 
sary  exploiting  the  separation  of  read  and  write  in¬ 
structions  in  PRAMs.  When  reads  and  writes  are 
accounted  together  in  update  cycles  it  no  longer  ap¬ 
plies.  Instead,  we  show  that  the  Write-All  luoblem  of 
size  N  requires  D(//logAf)  work.  This  lower  bound 
holds,  even  if  proces.sors  could  read  and  locally  pro¬ 
cess  all  the  shared  memory  at  unit  cost.  Our  sinqile 
lower  bound  is  of  interest,  because  it  is  matched  by  an 
0{N  log  N)  up[)er  bound  under  these  assumptions.  (Re¬ 
mark:  An  VllNIogN)  lower  bound  was  recently  shown 
in  [KPRS  90]  using  a  dilTcrent  technique  and  different 
assutnptions  for  a  fail-stop  no-restart  model.)  The  up¬ 
per  bound  proof  arguments  lead  to  a  modification  of 
the  basic  algoritlim  of  [KS  89],  so  that  it  is  efficient  and 
correct  in  both  the  original  setting,  and  with  the  fail¬ 
ure  and  restart  errors.  We  describe  these  arguments  in 
Section  5. 

In  .Section  d  we  present  the  main  result  and  sujiport- 
ing  algorithms.  This  is  a  simulation  strategy  for  any 
N  processor  PRAM  on  a  restartable  fail-stop  P  proces¬ 
sor  CRCW  PRAM  such  that:  it  guarantees  a  terminat¬ 
ing  execut  ion  of  each  simulated  N  processor  step,  with 
O{iog^  N)  overhead  ratio,  and  (sub-quadratic)  com¬ 
pleted  work  0(min{  Af  -t-  P  log^  N  -I- M  log  A^,  N  ■  P”  ^)), 
where  M  is  the  number  of  failures  during  this  step’s 
sirrnilation. 
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This  strategy  is  work-optimal  when  the  number  of 
simulating  processors  is  P  <  N/\og^  N  and  the  total 
number  of  failures  per  each  simulated  N  processor  step 
is  0{N/  logN).  The  optimality  result  is  preserved,  of 
course,  in  the  absence  of  failures.  Our  approach  is  based 
on:  (a)  a  new  algorithm  for  Write-All  whose  conjpleted 
work  is  0{N  ■  P'°83  §+^)  for  P  <  TV  and  any  6  >  0, 
and  which  can  handle  any  pattern  of  failures/restarts, 

(b)  a  modification  of  an  algorithm  from  [KS  89],  and 

(c)  the  techniques  developed  in  [KPS  90,  Shv  89). 

The  lower  bounds  apply  to  the  worst  case  work  of  de¬ 
terministic  algorithms  as  well  as  to  the  expected  work  of 
randomized  and  deterministic  algorithms.  Interestingly, 
randomization  docs  not  seem  to  help,  given  on-line, 

i.e,  non-prespecified,  patterns  of  failures.  For  example, 
it  is  Ccusy  to  construct  on-line  failure  and  restart  (no¬ 
restart)  patterns  that  lead  to  exponential  (quadratic) 
in  N  expected  performance  for  the  algorithms  presented 
in  [MSP  90].  These  sta/tiiij  adversaries  are  dcscrilicd  in 
Section  5,  where  we  also  corclude  with  some  open  prob¬ 
lems. 

2  Definitions 

2.1  Restartable  fail-stop  CRCW  PRAM 

VVe  u.se  the  COMMON  CRCW  PRAM  model,  where  all 
concurrently  writing  processors  write  the  same  value. 
Processors  are  subject  to  stop  failures  and  restarts  as 
in  [SS  83).  Our  algorithms  are  described  in  a  model 
independent  fashion  using  high  level  notation  with  the 
obvious  forall/parbegin/pareiid  parallel  construct. 

The  basis  of  the  model  is  the  PRAM  of  [FW  78): 

1.  There  are  P  iinRa/ synchronous  processors.  Each 
processor  has  a  unique  permanent  identifier  (Pl») 
in  the  range  0,  .  .  . ,  7’—  1 ,  and  each  proces.sor  always 
knows  its  IMD,  and  the  number  of  processors  P. 

2.  The  global  memory  accessible  to  all  processors 
is  denoted  as  shared,  in  addition  each  proces¬ 
.sor  has  a  constant  size  local  memory  denoted  as 
private.  All  memory  cells  are  capable  of  storing 
0(logmax{/V,  P})  bits  on  inputs  of  size  N. 

3.  1  he  input  is  stored  in  N  cells  in  shared  memory, 
and  the  rest  of  the  shared  memory  is  cleared  (i.e., 
contains  zeroes).  'I'he  proces.sors  have  access  to  the 
input  and  its  size  /V. 

In  all  our  algorithms: 

•  1  he  PRAM  processors  execute  sequences  of  in¬ 
structions  that  are  grouped  in  update  eyries.  Each 


update  cycle  consists  of  reading  a  small  fixed  num¬ 
ber  of  shared  memory  cells  (e.g.,  <  4),  performing 
some  fixed  time  computation,  and  writing  a  small 
fixed  number  of  shared  memory  cells  (e.g.,  <  2). 

The  parameters  of  the  update  cycle,  i.e.,  the  number 
of  read  and  write  instructions,  are  fixed,  but  depend 
on  the  instruction  set  of  the  PRAM;  see  [FW  78]  for  a 
PRAM  instruction  set.  The  values  quoted  (4  and  2)  are 
sufficient  for  our  exposition. 

We  use  the  fail-stop  with  restart  failure  model,  where 
time  instances  are  the  PRAM  .synchronous  clock-ticks: 

1.  A  failure  pattern  F  (i.e.,  failures  and  restarts)  is  de¬ 
termined  by  an  on-line  adversary,  that  knows  ev¬ 
erything  about  the  algorithm  and  is  unknown  to 
the  algorithm. 

2.  Any  processor  may  fail  at  any  time  during  any  up¬ 
date  cycle,  or  having  failed  it  may  restart  at  any 
time,  provided  that: 

(i)  at  any  time  during  the  computation  at  least  one 
processor  is  executing  an  update  cycle  that  success¬ 
fully  completes,  and 

(ii)  failures  can  occur  before  or  after  a  write  of  a 
single  bit  but  not  during  the  write,  i.e.,  bit  writes 
are  atomic. 

3.  Failures  do  not  affect  the  shared  memory,  but  the 
failed  processors  lose  their  private  memory.  Pro¬ 
cessors  are  restarted  at  their  initial  state  with  their 
PID  as  their  only  knowledge. 

Note  that  failures  here  are  different  from  the  errors  of 
omission,  where  proces.sors  preserve  their  local  context. 
The  failure  and  restart  patterns  are  syntactically  defined 
as  follows: 

Definition  2.1  A  failure  pattern  F  is  a  set  of  triples 
<tag.  Pin,  t  >  where  tag  is  either  failure  indicating  pro¬ 
cessor  failure,  or  restart  indicating  a  processor  restart, 
Pin  is  the  processor  identifier,  and  t  is  the  time  indicat¬ 
ing  when  the  processor  stops  or  restarts.  The  si:e  of  the 
failure  pattern  F  is  defined  as  the  cardinality  |F|.  □ 

For  simplicity  of  presentation,  we  assume  that  the 
PRAM  shared  memory  writes  of  O(logmax{7/,  /’])  bit 
words  are  atomic.  Algorithms  using  this  assumption  can 
be  easily  converted  to  u.se  only  single  bit  atomic  writes 
as  in  [KS  89]. 

We  investigate  two  natural  complexity  measures, 
completed  work  and  overhead  ratio.  The  completed 
work  measure  generalizes  the  standard  Parallel-time  x 
Prorrssors  product  and  the  available  processor  steps  of 
[KS  89],  The  overhead  ratio  is  an  amortized  measure. 


DeHnitioii  2.2  Consider  an  algorithm  with  P  initial 
processors  that  terminates  in  parallel-time  r  after  com¬ 
pleting  its  t2isk  on  some  input  data  I  and  in  the  presence 
of  a  failure  pattern  F.  If  Pi{I,  F)  <  P  is  the  number  of 
processors  completing  an  update  cycle  at  time  i,  and  c 
is  the  time  required  to  complete  one  update  cycle,  then 
we  define  5(7,  F,  P)  as: 

T 

5(/,F.P)  =  c5]]Fi(/,F).  □ 

i  =  l 

Definition  2.3  A  P-processor  PRAM  algorithm  on 
any  input  data  I  of  size  |/|  =  N  and  in  the  presence  of 
any  pattern  F  of  failures  and  restarts  of  size  |F|  <  M: 

(i)  uses  completed  work: 

S  =  Sn,m^p  =  max{5(/,  F,  P)}  , 

(ii)  liEis  overhead  ratio: 

fS(/,F,P)]  ^ 

Remark  1  Update  cycles  are  units  of  accounting.  They 
do  not  constrain  the  instruction  set  of  the  PRAM  and 
failures  can  occur  between  the  instructions  of  an  update 
cycle.  However,  note  that  in  5(7,  F,P)  the  processors 
are  not  charged  for  the  read  and  write  instructions  of 
update  cycles  that  are  not  completed. 

Remark  2  Consider  a  definition  of  work  S'(7,F,  P) 
that  also  counts  incomplete  update  cycles.  Clearly 
•S’'(7,F,P)  <  5(7,  F,P)  +  c|F(.  Thus,  using  S'  docs 
a-symptotically  affect  the  measure  of  work  (when  |F|  is 
very  large),  but  it  does  not  asymptotically  affect  a. 

Remark  3  One  might  also  generalize  the  overhead  ra¬ 
tio  as  I  where  7'(|/|)  is  the  time  complexity 

of  the  best  sequential  solution  known  to  date  for  the 
particular  problem  at  hand.  For  the  purposes  of  this 
exposition,  it  is  sufficient  to  express  a  in  terms  of  the 
ratio  This  is  because  for  Write- All  (by  it.sclf 

and  as  used  in  the  simulation)  7'(17|)  =  0()/|). 

2.2  Discussion  of  the  technical  choices 

Work  vs.  overhead  ratio: 

When  dealing  with  arbitrary  processor  failures  and 
restarts,  the  completed  work  measure  5  depends  on  the 
size  N  of  the  input  7,  the  number  of  processors  P,  and 
the  size  of  failure  pattern  F.  The  ultimate  performance 
goal  for  a  parallel  fault-tolerant  algorithm  is  to  be  able 
to  perform  the  refpiired  computation  at  a  work  cost  as 


close  as  possible  to  the  work  performed  by  the  best  se¬ 
quential  algorithm  known.  Unfortunately,  this  goal  is 
not  attainable  when  an  adversary  succeeds  in  causing 
too  many  processor  failures  during  a  computation. 

Example  2.1  Consider  a  Write-All  solution,  where  it 
takes  a  proce.ssor  one  instruction  to  recover  from  a  fail¬ 
ure.  If  an  adversary  in  a  failure  pattern  F  with  the 
number  of  failures  and  restarts  |F|  =  fore  >  0, 

then  the  completed  work  will  be  and  thus  al¬ 

ready  non-optimal  and  potentially  large,  regardless  of 
how  efficient  the  algorithm  is  otherwise.  Yet  the  algo¬ 
rithm  may  be  extremely  efficient,  since  it  takes  only  one 
instruction  to  handle  a  failure.  □ 

This  illustrates  the  need  for  a  measure  of  efficiency 
that  is  sensitive  to  both  the  size  of  the  input  N ,  and 
the  n\imbcr  of  failures  and  restarts  M  =  |F|.  When 
M  =  0{P)  as  in  the  Ccise  of  the  stop  failures  without 
restarts  in  [KS  89],  5  properly  describes  the  algorithm 
efficiency,  and  cr  =  0(  ).  However,  when  F  can  be 

large  relative  to  N  and  P  (as  is  the  case  when  restarts 
are  allowed)  cr  better  reflects  the  efficiency  of  a  fault- 
tolerant  algorithm. 

Recall  from  Remark  2,  that  a  is  insensitive  to  the 
choice  of  5  or  5',  and  to  using  update  cycles,  as  a  mea¬ 
sure  of  work.  However,  update  cycles  are  necessary  for 
the  following  two  reasons. 

Update  cycles  and  termination: 

Our  failure  model  requires  that  at  any  time,  at  least  one 
processor  is  executing  an  update  cycle  that  completes. 
(This  condition  subsumes  the  condition  of  [KS  89]  that 
one  processor  does  not  fail  during  the  computation). 
This  requircjnent  is  formulated  in  terms  of  update  cycles 
and  assures  that  .some  progress  is  made.  Without  it, 
the  algorithms  may  not  terminate,  and  when  they  do 
terminate  the  work  may  not  be  bounded  by  a  function 
of  N  and  P.  Since  the  processors  lose  their  context 
after  a  failure,  they  have  to  read  something  to  regain 
it.  Without  at  least  one  active  update  cycle  completing, 
the  adversary  can  force  the  PRAM  to  thrash  by  allowing 
only  these  reads  to  be  performed.  Similar  concerns  are 
disciissed  in  [SS  8.3]. 

Update  cycles  as  a  unit  of  accounting: 

In  our  definition  of  completed  work  we  only  count  com¬ 
pleted  update  cycles.  Even  if  the  progress  and  termina¬ 
tion  of  a  computation  is  assured  (by  always  completely 
executing  at  least  one  update  cycle),  but  the  proces¬ 
sors  are  charged  for  incomplete  update  cycles,  the  work 
S'  of  any  algorithm  that  simulates  a  single  N  proces- 
.sor  PRAM  step  is  at  least  f2(P  ■  N).  The  reason  for 


4 


this  quadratic  behavior  in  S'  is  tlie  following  simple  and 
rather  uninteresting  thrashing  adversary. 

Example  2.2  Let  ALG  be  any  algorithm  that  solves 
the  Write-All  problem  under  the  arbitrary  failure  and 
restart  model.  Consider  the  standard  PRAM  read,  com¬ 
pute,  write  cycles  (if  processors  begii\  writing  without 
reading,  a  simple  modification  of  the  following  argu¬ 
ment  leads  to  the  same  result).  A  thrashing  adversary 
allows  all  processors  to  perform  the  read  and  compute 
instructions,  then  it  fails  all  but  one  processor  for  the 
write  operation.  The  adversary  then  restarts  all  failed 
processors.  Since  one  write  operation  is  performed  per 
read,  compute,  write  cycle,  N  cycles  will  be  required 
to  initialize  N  array  elements.  Each  of  the  P  proces¬ 
sors  performs  Q{N)  instructions  which  results  in  work 
of  0(P  ■  N).  a 

By  charging  the  proce.s.sors  only  for  the  completed 
fixed  size  update  cycles,  and  not  for  partially  completed 
cycles,  we  do  not  charge  for  thrashing  adversaries.  It  is 
interesting  that  this  change  in  cost  measure  allows  sub¬ 
quadratic  solutions. 

2.3  An  architecture  for  a  restartable 
fail-stop  multiprocessor 

The  main  goal  of  this  work  is  to  study  algorithmic  tech¬ 
niques  that  enable  efficient  parallel  computation  on  mul¬ 
tiprocessor  systems  whose  proce.ssors  are  subject  to  fail- 
stop  errors  and  restarts.  Here  we  suggest  one  way  of 
realizing  our  abstract  model  of  computation. 

Engineering  and  technological  approaches  exist  that 
allow  implementing  electronic  components  and  systems 
that  operate  correctly  when  subjected  to  certain  failures 
(for  examples,  see  (IEEE  90,  Cri  91]).  The  technologies 
we  cite  below  are  instruinontal  in  providing  the  basic 
hardware  fault-tolerance,  thus  providing  a  foundation 
on  which  the  algorithmic  and  software  fault-tolerance 
can  be  built. 

Semiconductor  memories  are  the  e.s.sential  compo¬ 
nents  of  processors  and  of  shared  memory  parallel  .sys¬ 
tems.  These  memory  are  being  routinely  manufactured 
with  built-in  fault  tolerance  using  replication  and  coding 
techniques  without  appreciably  degrading  performance 
(.see  the  survey  [SM  84]). 

Interconnection  networks  are  typically  used  in  a  mul¬ 
tiprocessor  system  to  provide  communication  among 
processors,  memory  modules  and  other  devices,  e.g., 
as  in  the  Ultrarompntcr  [Sch  80].  The  fault-tolerance 
of  interconnection  networks  has  been  the  subject  of 
much  work  in  its  own  turn.  'I’he  networks  are  made 
more  reliable  by  employing  redundancy  (see  the  survey 
(AA.S  87]).  A  roniAi7UTij7  mterconnection  network  that  is 


Figure  1:  A  robust  fail-stop  multiproces.sor. 


perfectly  suited  for  implementing  synchronous  concur¬ 
rent  reads  and  writes  is  formally  treated  in  [KRS  88]. 

Finally  fail-stop  processors  are  formally  treated  and 
justified  in  (SS  83]. 

The  abstract  model  that  we  are  studying  can  be  real¬ 
ized  (Figure  1)  in  the  following  architecture,  using  the 
com|>onents  we  have  just  overviewed: 

1.  'J'here  are  P  fail-stop  processors,  each  with  a  unicpie 
address  and  .some  amount  of  local  memory.  Proce.s¬ 
sors  are  unreliable. 

2.  There  are  Q  addressable  shared  memory  cells.  The 
input  of  size  N  <  Q  is  stored  in  shared  memory. 
'Phis  memory  is  a-ssumed  to  be  reliable. 

3.  Interconnection  of  processors  and  memory  is  pro¬ 
vided  by  a  synchronous  combining  interconnection 
network.  This  network  is  assumed  to  be  reliable. 

With  this  architecture,  our  algorithmic  techniques  be¬ 
come  comj>letely  applicable,  i.e.,  the  algorithms  and 
.simulations  we  develop  will  work  correctly,  and  within 
the  complexity  bounds  (under  the  unit  cost  memory  ac¬ 
cess  assumption)  for  all  patterns  of  processor  failures 
and  restarts.  This  is  true  for  as  long  a®  the  shared  mem¬ 
ory  and  the  interconnection  network  are  subject  to  the 
failures  within  their  respective  de.sign  parameters. 

3  Lower  bounds 

As  we  have  shown  in  Example  2.2,  without  the  update 
cycle  accounting  there  is  a  thrashing  adversary  that  ex¬ 
hibits  a  (|uadratic  lower  bound  for  the  Wnte-All  jirob- 
leiii.  With  tlie  update  cycle  accounting,  we  prove  a 
Q(N  log  N)  lower  bound  theorem. 

TIiooicuii  3.1  (liven  any  A-processor  CRCW  PRAM 
algorithm  that  solves  the  Wrtte-All  problem  of  size  N, 
then  the  adversary,  that  can  cause  arbitrary  proces.sor 
failures  and  restarts,  can  force  the  algorithm  to  perform 
il{N  log  N)  completed  work  steps. 


Proof:  Let  Z  be  any  algorithm  for  the  Write-All  prob¬ 
lem  subject  to  arbitrary  failure/restarts  using  update 
cycles.  Consider  each  PRAM  cycle.  The  adversary  uses 
the  following  iterative  strategy: 

All  N  processors  are  revived.  For  the  upcoming  cycle, 
the  adversary  determines  the  processors  assignment  to 
array  elements.  Let  C/  >  1  be  the  number  of  unvisited 
array  elements.  By  the  pigeonhole  principle,  for  any 
processor  assignment  to  the  U  elements,  there  is  a  set 
of  [yj  unvisited  elements  with  no  more  than  [y]  pro¬ 
cessors  assigned  to  them.  The  adversary  chooses  half  of 
the  remaining  previously  unvisited  array  locations  that 
would  have  had  no  more  than  [yj  processors  assigned 
to  them,  and  it  fails  these  processors,  allowing  all  oth¬ 
ers  to  proceed.  Therefore  at  least  [yj  processors  will 
complete  this  step  having  visited  no  more  than  half  of 
the  remaining  unvisited  array  locations. 

This  strategy  can  be  continued  for  at  least  log  AT  it¬ 
erations.  The  work  S  performed  by  the  algorithm  will 
be5>  [yjlog//  =  QiN\ogN).  D 

This  lower  bound  is  the  tightest  possible  bound  under 
the  assumption  that  the  processors  can  read  and  locally 
process  the  entire  shared  memory  at  unit  cost.  Such  an 
assumption  is  very  strong.  However  we  take  advantage 
of  the  constructive  proof  strategy  in  the  next  section. 

Theorem  3.2  If  the  fail-stop  processors  can  read  and 
locally  process  the  entire  shared  memory  at  unit  cost, 
then  a  solution  for  the  Write-All  problem  can  be  con¬ 
structed  such  that  its  completed  work,  when  using  N 
processors  on  the  input  of  size  N  is  S  =  Q(N  log  W). 

Proof:  We  complement  the  previous  lower  bound  with 
the  following  oblivious  strategy;  at  each  step  that  a 
processor  PID  is  active,  it  reads  the  N  elements  of  the 
array  a:[l..A]  to  be  visited.  Say  U  of  these  elements 
are  still  not  visited.  The  processor  numbers  these  U 
elements  from  1  to  f/  based  on  their  position  in  the 
array,  and  assigns  itself  to  the  r’th  unvisited  element  such 
that  i  =  \PID  ■  'I’liis  achieves  load  balancing  with 
no  more  than  processors  assigned  to  each  unvisited 
element. 

We  list  the  elements  of  the  Write-All  array  according 
to  the  time  at  which  the  elements  are  visited  in  a.scend- 
ing  order.  We  break  this  list  into  adjacent  .segments 
numbered  sequentially  starting  with  1,  such  that  seg¬ 
ment  j  contains  Vj  —  elements,  for  j  =  l,...,rn 

and  for  some  rn  <  When  proce.ssors  were  assigned 

to  the  elements  of  the  jth  segment,  there  were  no  less 
than  f/j  =  N-  Vi>  N  -(N  -  f  unvisited 

elements.  'I'herefore  no  more  than  proce.s.sors  were 
assigned  to  each  element. 

The  work  |)erformed  by  such  an  algorithm  is: 


5  <  Er=i 

=  0(Nj:jL,j^)  =  0{NlogN)  .  O 

4  Computation  on  restartable 
fail-stop  processors 

We  first  state  the  main  result  and  then  build  the  frame¬ 
work  for  proving  it. 

Theorem  4.1  Any  A’-processor  PRAM  algorithm  can 
be  executed  on  a  fail-stop  P-processor  CRCW  PRAM, 
with  P  <  N.  Each  A-processor  PRAM  step  is  executed 
in  the  presence  of  any  pattern  F  of  failures  and  restarts 
of  size  M  with: 

(i)  the  completed  work: 

S  =  0(min{  A  -I-  P  log^  A  +  M  log  A,  A  ■  P'’  ®}), 

(ii)  the  overhead  ratio; 

(T  =  0(log^  A). 

EREW,  CREW,  and  weak  and  common  CRCW  PRAM 
algorithms  are  simulated  on  fail-stop  COMMON  CRCW 
PRAMs;  Arbitrary  and  strong  CRCW  PRAMs  are 
simulated  on  fail-stop  CRCW  PRAMs  of  the  same  type. 
O 

Remark  4  Priority  CRCW  PRAMs  cannot  be  di¬ 
rectly  simulated  using  the  same  framework,  for  one  of 
the  algorithms  used  (namely  algorithm  X  in  Section  4.2) 
does  not  posse.ss  the  processor  allocation  monolonicity 
property  that  a.ssures  that  higher  numbered  processors 
simulate  the  steps  of  the  higher  numbered  original  pro¬ 
cessors. 

We  obtain  this  result  by:  (a)  modifying  an  algorithm 
from  [KS  89]  to  enable  its  use  with  restarts,  (b)  pre¬ 
senting  a  new  algorithm  that  has  a  good  overhead  ratio 
efficiency  ami  that  terminates  with  sub-quadratic  com¬ 
pleted  work,  (c)  merging  the  two  algorithms,  and  using 
the  techniques  of  [KPS  90]  or  [Shv  89]  to  produce  elfi- 
cient  executions  of  arbitrary  PRAM  programs  on  faulty 
CRCW  PRAMs. 

We  assume  that  A  is  a  power  of  2.  Non  powers  of  2 
can  be  handled  using  conventional  padding  techniques. 
All  logarithms  are  base  2.  Now  the  details. 

4.1  Algorithm  V:  a  modification  of  IP 
of  [KS  89] 

Algoritm  W  of  [KS  89]  is  an  efficient  fail-stop  (no 
restart)  Write-All  solution.  The  algorithm  uses  full  bi¬ 
nary  trees  a.s  its  basic  data  structures.  The  trees  are  im¬ 
plicitly  coded  as  heaps  and  are  stored  in  linear  arrays. 
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The  algoritlim  uses  an  iterative  approach  in  which  all 
active  processors  synchronously  execute  the  following 
four  phases: 

1.  In  the  first  phase  the  processors  are  counted  and 
enumerated  using  a  static  bottom-up,  logarithmic 
time  traversal  of  the  processor  counting  tree  data 
structure. 

2.  In  the  second  phase  the  processors  are  allocated  to 
the  unvisited  array  locations  according  to  a  divide- 
and-conquer  strategy  using  a  dynamic  top-down 
traversal  of  a  progress  tree  data  structure. 

3.  The  third  phase  is  where  the  actual  work  (array 
assignments)  is  done. 

4.  In  the  fourth  phase  the  progress  is  evaluated  by  a 
dynamic  bottom-up  traversal  of  the  progress  tree. 

This  algorithm  has  eIRcient  completed  work  when 
subjected  to  arbitrary  failure  patterns  without  restarts. 
It  can  be  extended  to  handle  proce.ssor  restarts  by  in¬ 
troducing  an  iteration  counter,  and  having  the  revived 
processors  wait  for  the  start  of  a  new  iteration.  How¬ 
ever  this  algorithm  may  not  terminate  if  the  adversary 
does  not  allow  any  of  the  processors  that  were  alive  at 
the  beginning  of  an  iteration  to  complete  that  iteration. 
Even  if  the  extended  algorithm  were  to  terminate,  its 
completed  work  is  not  bounded  by  a  function  of  N  and 
P. 

In  addition,  the  proof  framework  of  (KS  89]  does  not 
easily  extend  to  include  processor  restarts,  because  the 
processor  enumeration  and  allocation  phases  become  in¬ 
efficient  and  possibly  incorrect,  since  no  accurate  esti¬ 
mates  of  active  processors  can  be  obtained  when  the 
adversary  can  revive  any  of  the  failed  processors  at  any 
time. 

On  the  other  hand,  the  second  phase  of  algorithm  W 
can  implement  the  proce.s.sor  a.ssignment  based  on  the 
proof  of  Theorem  3.2  in  O(logA^)  time  by  using  the 
permanent  processor  PID  in  the  top-down  divide-aiid- 
conquer  allocation.  This  also  suggests  that  the  processor 
enumeration  phase  of  algorithm  VV  does  not  improve  its 
efficiency  when  processors  can  be  restarted. 

Therefore  we  present  a  modified  versioti  of  algorithm 
W ,  that  we  call  V. 

V  uses  the  data  structures  of  tlie  optimized  algorithm 
[V  of  [KS  89],  i.e,,  full  binary  trees  with  leaves,  for 
progress  estimation  and  processor  allocation.  4’here  are 
log  N  array  elements  a.ssociated  with  each  leaf.  When 
using  P  processor  such  that  P  >  on  such  data 

structures,  it  is  sufficient  for  each  processor  to  take  its 
PII)  modulo  to  a.ssure  that  there  is  a  uniform  ini- 

lial  a.ssignment  of  at  least  [/y  J  and  no  more  than 
[ P/ jg ]  processors  to  a  work  element. 


Algorithm  V  is  an  iterative  algorithm  through  the 
following  three  ph£ises  (we  “prime”  the  phases  to  dis¬ 
tinguish  them  from  the  phases  of  algorithm  W): 

1'  Allocate  processors  using  PIDs  in  a  dynamic  top- 
down  traversal  of  the  progress  tree  to  assure  load 
balancing  (O(logN)  time). 

2'  The  processors  now  perform  work  at  the  leaves  they 
reached  in  Phase  1'  (there  are  logjV  array  elements 
per  leaf). 

3'  The  processors  begin  at  the  leaves  of  the  progress 
tree  where  they  ended  Phase  2'  and  update  the 
progress  tree  dynamically,  bottom  up  (O(logN) 
time). 

The  following  implementation  detail  is  important  in 
realizing  processor  re-synchronization  after  a  failure  and 
a  restart.  An  iteration  wrap-around  counter  is  utilized, 
so  that  if  a  processor  fails,  and  then  is  restarted,  it  waits 
for  the  counter  wrap-around  to  rejoin  the  computation. 
The  point  at  whicli  the  counter  wraps  around  depends 
on  the  length  of  the  program  code,  but  it  is  fixed  at 
“compile  time”.  If  after  a  restart,  a  proces.sor  detects 
that  the  counter  did  not  change  for  one  cycle,  it  asserts 
that  no  processors  were  active  at  the  point  of  the  restart, 
and  it  can  start  a  new  iteration  by  itself  -  this  is  possible 
since  the  processors  are  synchronous. 

Analysis  of  algorithm  V: 

We  now  analyze  the  performance  of  this  algorithm  first 
in  the  fail-stop,  and  then  in  the  fail-stop  and  restart 
setting. 

Lemma  4.2  The  completed  work  of  V  using  P  <  N 
processors  that  are  subject  to  fail-stop  errors  without 
restarts  is  5  =  0(A^  -t-  Plog^  N). 

Proof:  We  distinguish  two  cases  below.  In  each  of  the 
cases,  it  takes  0(log  )  =  C)(logAI)  time  to  perform 
proce.ssor  allocation,  and  0{\ogN)  time  to  perform  the 
work  at  the  leaves.  Thus  each  iteration  of  the  algorithm 
takes  O(logA)  time.  We  use  Theorem  3.2,  where  in¬ 
stead  of  reading  and  locally  processing  the  entire  mem¬ 
ory  at  unit  cost,  we  use  an  O(logAI)  time  iteration  for 
processor  allocation. 

Case  I:  I  <  P  <  In  this  case,  at  most  1  processor 

is  initially  allocated  to  each  leaf.  Similarly  to  Theorem 
3.2,  when  the  first  —  P  leaves  are  visited,  there 

are  no  more  than  I  processor  allocated  to  each  leaf, 
by  the  balanced  allocation  phase.  When  the  remain¬ 
ing  P  or  less  leaves  are  visited,  the  work  is  0(Plog/’) 
by  Theorem  3.2  (not  counting  processor  allocation). 
Each  leaf  visit  takes  0(log  A)  work  steps,  therefore  tlie 


01  forall  processors  P1D=0..P  —  1  parbegin 

02  Perform  initial  processor  assignment  to  the  leaves  of  the  progress  tree 

03  while  there  is  still  work  left  in  the  tree  do 

04  if  current  subtree  is  done  then  move  one  level  up 

05  elseif  this  is  a  leaf  then  perform  the  work  at  the  leaf 

06  elseif  this  is  an  interior  tree  node  then 

07  if  both  subtrees  are  done  then  update  the  tree  node 

08  elseif  only  one  is  done  then  go  to  the  one  that  is  not  done 

09  else  move  to  the  icft/riglit  subtree  according  to  PID  bit  values 

10  fl 

11  fl 

12  od 

13  par end 


Figure  2:  A  high  level  view  of  the  algorithm  X . 


completed  work  S  =  0((i3^  —  P  +  P\ogP)\ogN)  = 
0{N  +  P  log  P  log  AT)  =  0{N  +  Plog^  N). 

Case  2:  <  P  <  this  case,  no  more  than 

[P/  io";vl  processors  are  initially  allocated  to  each  leaf. 
Any  two  processors  that  are  initially  allocated  to  the 
same  leaf,  should  they  both  survive,  will  behave  identi¬ 
cally  throughout  the  computation.  Therefore  we  can  use 
Theorem  3.2  with  the  processor  allocation  as 

a  multiplicative  factor.  From  this  the  completed  work  5 
'®  r^/ 55^77  =  0{P\og^  N). 

The  results  of  the  two  cases  are  combined  to  yield 
S  =  OiN  +  Plog'^  N).  □ 

The  following  theorem  expresses  the  completed  work 
of  the  algorithm: 

Theorem  4.3  The  completed  work  of  V  using  P  <  N 
processors  subject  to  arbitrary  failure  and  restart  pat¬ 
tern  F  of  size  M  is:  5  =  0{N  -b  P  log^  N  +  M  log  N). 

Proof:  7'he  proof  of  Lemma  4.2  does  not  rely  on  the 
fact  that  in  the  absence  of  restart,  the  number  of  ac¬ 
tive  processors  is  non-increasing.  However  the  lemma 
does  not  account  for  the  work  that  might  be  spent  by 
the  processors  that  are  active  during  a  part  of  an  it¬ 
eration  without  contributing  to  the  progre.ss  of  the  al¬ 
gorithm  due  to  failures.  To  account  for  all  work,  we 
are  going  to  charge  to  the  array  being  proces.sed  the 
work  that  contributes  to  progress,  and  any  work  that 
was  “wasted”  due  to  failures  will  be  charged  to  the  fail¬ 
ures  and  restarts.  Lemma  4.2  accounts  for  the  work 
charged  to  the  array.  Otherwise,  we  observe  that  a  pro- 
ces.sor  can  “waste”  no  more  than  O(log  A)  time  steps 
without  contributing  to  the  progress  due  to  a  failure 
and/or  a  restart.  Therefore  this  amount  of  “wasted” 
work  is  bounded  by  O(AflogA).  This  proves  the  theo¬ 
rem.  (Note  that  the  completed  work  S  of  V  is  small  for 


small  |P|,  but  it  is  not  bounded  by  a  function  of  P  and 
N  for  a  large  |P|).  □ 


4.2  Algorithm  X  and  its  analysis 

We  present  a  new  algorithm  X  for  the  Wnle-AU  prob¬ 
lem.  We  show  that  its  completed  work  complexity  is 
S  =  0{N  ■  P°  ®)  for  any  failure/reslart  pattern  using 
P  <  N  proce.ssors.  The  important  property  of  X  is  that 
it  has  a  bounded  sub-quadratic  completed  work  regard¬ 
less  of  the  failure  pattern,  and  if  a  very  large  number 
of  failures  occures,  say  |P1  =  Q{N  ■  P”  ®),  then  the  al¬ 
gorithm’s  overhead  ratio  a  becomes  optimal:  it  takes  a 
fi.xed  number  of  computing  steps  per  failure/recovery. 

The  algorithm  utilizes  a  progress  tree  of  size  N  as 
algorithm  V,  but  it  is  traversed  by  the  processors  in¬ 
dependently,  and  not  in  synchronized  phases.  This  re¬ 
flects  the  local  nature  of  the  processor  assignment  in 
algorithm  X  as  opposed  to  the  global  assignments  used 
in  algorithms  V  and  W.  Each  processor,  acting  inde¬ 
pendently,  searches  for  work  in  the  smallest  immediate 
subtree  that  has  work  that  needs  to  be  done,  it  then 
performs  the  neccessary  work,  and  moves  out  of  that 
subtree  when  no  more  work  remains.  Details  follow. 

Input:  Shared  array  x[l..A];  x[t]  =  0  for  1  <  t  <  A. 

Ontpnt;  Shared  array  x[l..A];  x[j]  =  1  for  1  <  ?  <  A. 

Data-structurcs:  The  algorithm  uses  a  full  binary  tree 
of  size  2 A  —  1,  stored  as  a  heap  d[l . .  .2A-1]  in  shared 
memory.  An  internal  tree  node  (i[i]  (*  =  1,...,A  — 
1)  has  the  left  child  d[2i]  and  the  right  child  f/[2»' -f  1]. 
Tlie  tree  is  used  for  progress  evaluation  and  processor 
allocation.  'Fhe  values  stored  in  the  heap  are  initially  0. 

The  A  elements  of  the  input  array  x[l  ...A]  is  as- 
■sociated  with  the  leaves  of  the  tree.  Element  x[)]  is 
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associated  with  (i[*  +  ^  —  1],  where  I  <  i  <  N.  The  al¬ 
gorithm  also  utilizes  an  array  t£)[0..f’—  1]  that  is  used  to 
store  individual  processor  locations  within  the  progress 
tree  d. 

Each  processor  uses  some  constant  amount  of  pri¬ 
vate  memory  to  perform  simple  arithmetic  computa¬ 
tions.  An  important  private  constant  is  PID,  containing 
the  initial  processor  identifier. 

Thus,  the  overall  memory  u.sed  is  0{N  -(-  P)  and  the 
data-structures  are  simple. 

Control-flow:  The  algorithm  consists  of  a  single  ini¬ 
tialization  and  of  the  parallel  loop.  The  high  level  view 
of  the  algorithm  is  in  Figure  2  (all  line  numbers  refer  to 
the  figure),  a  more  detailed  code  is  in  the  appendix. 

This  algorithm  is  performed  by  all  processors  that 
are  active.  The  initialization  (line  02)  assignes  the  P 
processors  to  the  leaves  of  the  progress  tree  so  that  the 
processors  are  assigned  to  the  first  P  leaves  by  storing 
the  initial  leaf  assignment  in  iy[PID].  The  loop  (lines 
03-12)  consists  of  a  multi-way  decision  (lines  04-11)  to; 
(line  04)  move  up  the  tree  if  the  current  node  is  marked 
done,  (line  05)  perform  the  work  if  at  a  leaf,  (line  07) 
update  the  interior  tree  node  if  both  of  its  subtrees  are 
done  by  changing  its  value  from  0  to  1,  (line  08)  move 
down  to  the  left/right  subtrees  based  on  either  the  one 
of  the  subtrees  being  not  done. 

For  the  final  case  (line  09),  the  processors  move  down 
when  neither  child  is  done  based  on  the  proces.sor  iden¬ 
tifier.  This  last  case  is  where  the  non-trivial  (italicized) 
decision  is  made.  The  PID  of  the  processor  is  used  at 
depth  /»  of  the  tree  node  based  on  the  value  of  the 
most  significant  bit  of  the  binary  representation  of  the 
PID:  bit  0  will  send  the  processor  to  the  left,  and  bit  1 
to  the  right. 

Remark  5  It  is  possible  to  perform  local  optimization 
of  the  algorithm  by:  (i)  evenly  spacing  the  P  processors 
NfP  leaves  apart  by  when  P  <  N,  and  by  (ii)  using 
the  integer  values  at  the  progress  tree  nodes  to  repre¬ 
sent  the  known  number  of  descendent  leaves  visited  by 
the  algorithm.  Onr  worst  ca.se  analysis  does  not  benefit 
from  these  modifications. 

Example  4.1  Consider  algorithm  A'  for  A  =  P  =  8. 
The  progress  tree  d  of  size  2N  —  1  =  15  is  used  to 
represent  the  full  binary  progre.ss  tree  with  eight  leaves. 
The  8  proces.sors  have  PIDs  in  the  range  0  through  7. 
Their  initial  positions  are  indicated  in  Figure  3  under 
the  leaves  of  the  tree. 

The  diagram  in  Figure  3  illustrates  the  state  of  a 
computation  where  the  processors  were  subject  to  some 
failures  and  restarts.  Heavy  dots  indicate  nodes  whose 
subtrees  are  finished.  'I’he  paths  being  traver.sed  by  the 


Figure  3:  Processor  traversal  of  the  progress  tree. 


processors  are  indicated  by  the  arrows.  Active  procc.s- 
sor  locations  (at  the  time  when  the  snapshot  was  taken) 
are  indicated  by  their  PIDs  in  brackets.  In  this  config¬ 
uration,  should  the  active  processors  complete  the  next 
cycle,  they  will  move  in  the  directions  indicated  by  the 
arrows;  |)rocessors  0  and  1  will  descend  to  the  left  and 
right  respectively,  processor  4  will  move  to  the  unvisited 
leaf  to  its  right,  and  processors  6  and  7  will  move  up.  □ 

Regardless  of  the  decision  made  by  a  processor  within 
the  loop  body,  each  iteration  of  the  body  consists  of  no 
more  than  four  shared  memory  reads,  a  fixed  time  com¬ 
putation  using  private  memory,  and  one  shared  mem¬ 
ory  write  (sec  the  appendix  for  the  detailed  algorithm). 
Therefore  the  body  can  be  implemented  as  an  update 
cycle. 

Analysis  of  algorithm  A': 

We  begin  by  showing  correctness  and  termination  of 
algorithm  A'  in  the  following  simple  lemma. 

Lemma  4.4  Algorithm  A’  with  N  processors  is  a  cor¬ 
rect  $2(Iog  N)  and  0(N)  time  fault-tolerant  solution  for 
the  Write- All  problem  of  size  N .  Q 

Now  a  lemma  relating  completed  work  when  overlap¬ 
ping  of  processors  occurs,  and  the  main  work  lemma.  In 
the  rest  of  this  section,  the  expression  “Sn.p”  denotes 
the  completed  work  on  inputs  of  size  N  using  P  initial 
proce.ssors  and  for  any  failure  pattern. 

Loiiima  4.5  For  algorithm  A,  if  A  is  the  size  of  the 
input,  and  N  <  Pi  <  P2,  then  the  work  using  Pi 
proces.sors  and  the  work  using  P2  processors  relate  as 
■S'iv.pj  <  r  ■ 

Proof  sketch:  This  follows  from  the  Definition  2.2  of  .8' 
and  the  observation  that  if  P  >  A,  then  exactly  log  A 
bits  of  the  PIDs  are  significant  during  the  execution  of 
algorithm  A'.  We  ob.serve  that  any  two  processors  whose 
PIDs  arc  ecpial  modulo  A,  will  expend  no  more  than  a 
single  processor  in  the  worst  case  at  twice  the  cost.  □ 
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Lemma  4.6  The  work  complexity  5  of  algorithm  X 
with  N  initial  processors  for  the  Write- All  problem  of 
size  N  and  for  any  pattern  of  failures  and  restarts  is 
S  =  for  ^  >  Q 

Proof:  We  will  show  that  for  any  positive  6  there  is 
a  constant  c,  such  that  S  <  .  We  proceed 

by  induction  on  the  height  of  the  progress  tree.  For  the 
base  C8tse:  we  have  a  tree  of  height  0  that  corresponds  to 
an  input  array  of  size  1,  and  exactly  1  processor.  Since 
at  Iccist  this  processor  will  be  active,  this  single  leaf 
will  be  visited  in  a  constant  number  of  steps.  Let  the 
work  expended  be  c'  for  some  constant  c'  that  depends 
only  on  the  lexical  structure  of  the  algorithm.  Therefore 
Si,i  =  c'  <  c  •  l'°83+<  for  all  c  >  c',  and  any  6  >  0. 

For  the  inductive  hypothesis:  we  assume  that  for  the 
tree  heights  less  than  \ogN,  and  for  any  <5  >  0,  the 
required  constant  c  exists.  We  then  prove  that  this  is 
true  for  the  tree  of  height  log  N . 

Consider  the  two  subtrees  of  the  root  (Figure  4).  The 
two  corresponding  subtrees  are  of  the  heights  log  N  — 
1.  By  the  definition  of  algorithm  yY,  no  processor  will 
leave  a  subtree  until  the  subtree  is  finished.  We  have  to 
consider  the  following  two  sub-cases;  (1)  both  subtrees 
are  finished  simultaneously,  and  (2)  one  of  the  subtrees 
is  finished  before  the  other. 

Case  1:  If  both  subtrees  are  finished  simultaneously, 
thgen  the  algorithm  will  then  terminate  after  some  small 
constant  number  of  steps  c'  when  a  processor  moves 
to  the  root  and  determines  that  both  of  the  subtrees 
are  finished.  By  the  inductive  hypothesis,  there  exists 
a  c  such  that  both  the  work  5/,  expended  in  the  left 
subtree  of,  and  the  work  Sr  in  the  right  subtree  are 
bounded  by  The  work  needed  for 

the  algorithm  to  terminate  is  at  most  c' N ,  and  so  the 
total  work  is: 

S  <  SlA  Sr  +  c'N  <  2SeL  n  +  c' N 

—  “3*3 

<  2c(f  +  c'N  =  Cj|7Af'°8  3+4  +  c'N. 

When  c  is  cho.sen  sufficiently  larger  than  c',  e.g.,  c  > 
3c',  then  S  < 

Case  2:  Assume  w.l.o.g.  that  the  left  subtree  is  finished 
first  with  Sl  —  Sn  N  <  c(y by  the  inductive 
hypothesis.  The  processors  from  the  left  subtree  will 
start  moving  via  the  root  to  the  right  subtree.  The 
path  traversed  by  any  processor  as  it  moves  to  the  right 
subtree  after  the  left  subtree  is  finished  is  bounded  by 
c' log  AT  for  a  predefined  constant  c'  (the  longest  path 
from  a  leaf  to  another  leaf).  No  more  than  the  original 
y  proces.sors  of  the  left  subtree  will  move,  and  so  the 
work  of  moving  the  proce.s.sors  is  bounded  by  c'y  log  N. 

By  Lemma  4.5  and  by  the  inductive  hypothesis,  the 
work  Sr  to  complete  the  right  subtree  using  N  proces¬ 
.sors  is  bounded  by  Sjv  <  2Sn  ^  <  2c(y)''’*‘^‘''*.  Af- 


/  \  logA-l  /  \ 

Sn  \ 

1 - \  i  U. 

N/2 

N/2 

Figure  4:  Inductive  step  for  Lemma  4.6. 


ter  this,  each  processor  will  spend  some  constant  iiiiiii- 
ber  of  steps  moving  to  the  root  and  terminating  the  al¬ 
gorithm.  This  work  is  bounded  by  c" N  for  some  small 
constant  c".  The  total  work  S  is; 

S  <SLA-c'^]ogN  -^-Sr  +  c"N 

<  c(f  )'°8  3+«  +  log  AT  -f  2c(f  -f-  c"N 

<  ^Ar'°R3+‘  -l-c'f  -hc"fV 

When  c  is  made  sufficiently  large  based  on  6  with 
respect  to  the  fixed  c'  and  c",  e.g.,  c  >  then; 

S  <  cA'°834-« 

Since  a  constant  c  depends  only  on  the  lexical  struc¬ 
ture  of  the  algorithm  and  6,  it  can  always  be  chosen 
sufficiently  large  to  satisfy  the  base  case  and  both  the 
cases  (1)  and  (2)  of  the  inductive  step.  This  completes 
the  proof  of  the  second  case  and  of  the  lemma.  □ 

Now  we  generalize  this  result  for  P  <  N: 

Theorem  4.7  There  is  an  algorithm  that  solves  the 
Write-AU  problem  with  completed  work  S  —  0(N  ■ 

plog|  +  «)  fpr 

any  6  >  0,  where  N  is  the  input  array 
size,  and  P  <  N  is  the  initial  number  of  processors. 

Proof  sketch:  We  position  the  P  processors  at  the  first 
P  elements  of  the  input  array.  It  is  easy  to  show  that 
S  =  0(^5r,p)  =  0(f  P''’83+<)  =  0{N  ■  P'°8  3+«).  □ 

For  example,  when  6  is  about  0.01,  S  =  0{N  ■  P°  ®). 
We  next  show  a  particular  performance  of  algorithm  A' 
such  that  its  completed  work  is  asymptotically  close  to 
its  upper  bound. 

Theorem  4.8  There  exists  a  pattern  of  fail-stop/ restart 
errors  that  cause  the  algorithm  A  to  perform  S  = 
f2(Af'°®3)  work  on  the  input  of  size  N  using  P  =  N 
proces.sors. 

Proof  sketch:  We  compute  the  exact  work  performed 
by  the  algorithm  when  the  adversary  adheres  to  the 
following  strategy:  the  proces.sor  with  PID  0  will  be  al¬ 
lowed  to  sequentially  traverse  the  progress  tree  in  post¬ 
order  starling  at  the  leftmost  leaf  and  finishing  at  the 
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rightmost  leaf.  The  processors  tliat  find  themselves  at 
the  same  leaf  as  the  processor  0  are  (re)started,  while 
the  rest  are  failed.  All  processors  with  PIDs  smaller 
than  the  index  of  the  last  leaf  visited  by  proce.ssor  0  are 
allowed  to  traverse  the  progress  tree  until  they  reacli  a 
leaf.  When  processors  reach  a  leaf,  the  failure/restarl 
procedure  is  repeated.  □ 


4.3  Combining  the  building  blocks 

An  approach  for  executing  arbitrary  PRAM  programs 
on  fail-stop  CRCW  PRAMs  (without  restart)  was  pre¬ 
sented  independently  in  [KPS  90]  and  [Shv  89].  The 
execution  is  based  on  simulating  individual  PRAM  com¬ 
putation  steps  using  the  Write- All  paradigm,  and  it  was 
shown  that  the  complexity  of  solving  a  A-size  instance 
of  the  Write-All  problem  using  P  fail-stop  processors, 
and  the  complexity  of  executing  a  single  A-processor 
PRAM  step  on  a  fail-stop  P-processor  PRAM  are  equal. 
Here  we  describe  how  algorithms  V  and  A'  are  combined 
with  the  framework  of  [KPS  90]  or  [Sliv  89]  to  yield  ef¬ 
ficient  executions  of  PRAM  programs  on  PRAMs  that 
are  subject  to  stop-failures  and  restarts  as  stated  in  The¬ 
orem  4.1. 

We  first  observe  that  the  executions  of  algorithms  V 
and  X  can  be  interleaved  to  yield  an  algorithm  that 
achieves  the  following  performance; 

Theorem  4.9  There  exists  a  Write-All  solution  u.s- 
ing  P  <  N  processors  on  instances  of  size  A  such 
that  for  any  pattern  F  of  failures  and  restarts  with 
|F|  <  M,  the  completed  work  is  S  =  0(min{A  -I- 
P  log^  A  -b  M  log  A,  A  ■  P'’  ®}),  and  the  overhead  ratio 
is  (7  =  O(log^  A)  . 

The  simulations  of  the  individual  PRAM  steps  are 
ba.scd  on  replacing  the  trivial  array  assignments  in  a 
Wrtte-All  solution  with  the  appropriate  components  of 
the  PRAM  steps.  These  steps  are  decomposed  into  a 
fixed  number  of  assignments  corresponding  to  the  stan¬ 
dard  fetch/decode/execnte  RAM  instruction  cycles  in 
which  the  data  words  are  moved  between  the  shared 
memory  and  the  internal  processor  registers.  The  re¬ 
sulting  algorithm  is  then  used  to  interpret  the  individ¬ 
ual  cycles  usijig  the  available  fail-stop  processors  and 
to  ensure  that  the  results  of  computations  are  stored  in 
temporary  memory  before  simulating  the  synchronous 
updates  of  the  shared  memory  with  the  new  values.  For 
the  details  on  this  technique,  the  reader  is  referred  to 
[KS  89,  KPS  90,  Shv  89].  Application  of  these  tech¬ 
niques  in  conjunction  with  the  algorithms  V  and  A'  yield 
efficient  and  terminating  executions  of  any  non-fault- 
tolerant  PRAM  programs  in  the  presence  of  arbitrary 
failure  and  restart  patterns. 


Theorem  4.1  follows  from  Theorem  4.9  and  the  results 
of  [KPS  90]  or  [Shv  89]. 

The  following  corollaries  are  also  interesting: 

Corollary  4.10  Under  the  hypothesis  of  Theorem  4.1, 
and  if  |F|  <  P  <  N,  then  S  =  0(N  -b  Plog^  A),  and 
rT  =  0(log^A). 

The  fail-stop  (without  restarts)  behavior  is  subsumed 
by  Corollary  4.10.  Without  restarts,  [KPRS  90]  have  an 
algorithm  with  S  =  0(A  +  P iigfog^ ) ’  [Mar  91]  has 

shown  that  the  same  performance  is  achieved  by  algo¬ 
rithm  IT  from  [KS  89].  The  exact  analysis  of  algorithm 
V  without  restarts  is  still  open. 

Corollary  4.11  Under  the  hypothesis  of  Theorem  4.1: 

1.  when  |F|  is  f2(AlogA),  then  cr  is  0(log  A), 

2.  when  |F|  is  f2(A'  ®),  then  tr  is  0(1). 

Thus  the  efficiency  of  our  algorithm  improves  for  large 
failure  patterns. 

These  results  also  suggest  that  it  is  harder  to  deal 
efficiently  with  a  few  worst  case  failures  than  with  a 
large  number  of  failures. 

Another  interesting  result  is  that  there  is  a  range  of 
parameters  for  which  the  completed  work  is  optimal, 
i.e.,  the  work  performed  in  executing  a  parallel  algo¬ 
rithm  on  a  faulty  PRAM  is  asymptotically  equal  to  the 
Parallel-tiincx  Processors  \no6uct  for  that  algorithm: 

Corollary  4.12  Any  A-processor,  r-time  PRAM  algo¬ 
rithm,  can  be  executed  on  a  F  <  A/log^A  proces¬ 
sor  fail-stop  CRCW  PRAM,  such  that  when  during  the 
execution  of  each  A-processor  step  of  that  algorithm 
the  total  number  of  proces.sor  failures  and  restarts  is 
0(A/logA),  then  the  completed  work  is  S  =  0{t  ■  A). 

It  also  follows  that  optimality  is  preserved  in  the  ab¬ 
sence  of  failures  or  when  during  the  execution  of  each  A 
processor  step  there  are  O(logA)  failures  and  restarts 
per  each  simulating  processor.  This  is  because  in  ei¬ 
ther  of  these  two  cases,  the  size  of  the  failure/restart 
pattern  F  is  bounded  by:  |F|  <  O(FlogA)  = 

0{^]ogN)  =  0(N/\ogN). 

5  Discussion  and  Open  Prob¬ 
lems 

We  conclude  with  a  brief  discussion  of  open  problems 
and  the  effects  of  on-line  adversaries  on  the  expected 
performance  of  randomized  algorithms.  First  the  oi<en 
problems  and  future  work: 
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•  Lower  bounds  with  and  witliout  restarts:  We 
have  shown  an  Q{N  log  N)  lower  bounds  for  fail¬ 
ures/restarts  under  tlie  assumption  that  processors 
can  read  and  locally  process  the  entire  shared  mem¬ 
ory  at  unit  cost.  Under  this  assumption  this  is  the 
best  possible  lower  bound. 

Under  the  same  assumption,  it  can  be  shown  that 
the  lower  bound  of  [KS  89]  of  U(N  log  N/  log  log  N) 
is  the  best  possible  bound  for  failures  without 
restarts. 

Under  a  different  assumptions,  an  ^{A^logAf)  is 
shown  for  failures  without  restarts  in  [KPRS  90]. 
Can  these  bounds  be  further  improved  using  differ¬ 
ent  assumptions? 

•  Upper  bounds  with  restarts:  Progress  in  this  area 
ought  to  be  made  by  finding  new  algorithms,  or 
improving  the  analysis  of  existing  algorithms  to 
achieve  better  completed  work  5  and  the  overhead 
ratio  <T  than  those  of  algorithms  V  and  ,Y. 

•  Upper  bounds  without  restarts:  What  is  the  worst 
case  completed  work  5,  and  overhead  ratio  cr  of  the 
algorithm  X  in  the  ca.se  of  fail-stop  errors  without 
restarts? 

Algorithm  A'  appears  to  have  a  very  good  perfor¬ 
mance  in  the  fail-stop  (without  restart)  framework 
of  [KS  89].  For  example,  the  adversary  used  to 
show  the  lower  bound  in  [KS  89]  cau.ses  the  worst 
case  work  of  S  =  Q{N  log^  N/  log  log  A^)  for  the  N- 
processor  IFrife-zl/f  solution  in  [KS  89].  'I'he  same 
adversaary  cau.ses  the  known  worst  case  work  of  A' 
of  S  =  Q(N  log  N  log  log  N/  log  loglog  A^). 

We  conjecture  that  the  fail-stop  (no  restart)  per¬ 
formance  of  A'  has  work  5  =  ©(A^  log  A^  log  log  Af) 
using  N  proce.ssors. 

•  For  the  update  cycles  used  in  this  work,  what  is  the 
minimum  number  of  reads  and  writes  that  are  suf¬ 
ficient  to  as.sure  efficient  solutions,  and  under  what 
assumptions? 

On  raudoini/.atioii  and  lower  hounds: 

The  existing  upper  bounds  for  randomired  .solutions  for 
Wrile-All  apply  to  off-line,  i.e.,  non-adaptive  adver¬ 
saries.  For  example,  the  lower  bounds  of  Section  3  apply 
to  both  the  worst  case  performance  of  deterministic  al¬ 
gorithms  and  the  expected  performance  of  randomized 
algorithms  (subject  to  adaptive  adversaries). 

A  randomized  n.synchronous  coupon  cltpptng  {ACC) 
algorithm  for  the  Wnle-All  problem  was  analyzed  in 
[MSP  90].  Assuming  off-line  adversaries,  it  was  shown  in 
[MSF’  90]  that  their  ACC  algorithm  performs  expected 


0{N)  work  using  P  =  i^g processors  on  inputs 
of  size  N . 

In  contrast,  we  observe  that  a  simple  stalking  adver- 
•sary  cau.ses  the  ACC  algorithm  to  perform  (expected) 
work  of  Q{N^ /poly  log  N)  in  the  case  of  fail-stoj)  er- 
rors,  and  ^^((pojy  log  )  work  in  the  case  of  fail- 

stop  errors  with  restart  even  when  using  P  <  yv 

processors.  The  stalking  adversary  strategy  consists 
of  choosing  a  single  leaf  in  a  binary  tree  em()loyed  by 
ACC,  and  failing  all  processors  that  touch  that  leaf  un¬ 
til  only  one  processor  remains  in  the  fail-stoj)  case,  or 
until  all  processors  simultaneously  touch  the  leaf  in  the 
fail-stop/restart  case.  This  performance  is  not  improved 
even  when  using  the  completed  work  accounting.  On  a 
positive  note,  when  the  adversary  is  made  off-line,  the 
ACC  algorithm  becomes  efficient  in  the  fail-stop/restart 
setting. 
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forall  processors  PID=0../’  —  1  parbegin 

shared  x[l..A'];  - shared  memory 

shared  d[1..2yV-l];  - "done”  heap  (progress  tree) 

shared  w[0..P-l];  - "where”  array 

private  done,  where;  - current  node  done/where 

private  left,  right;  - lejl/right  child  values 


action, recovery 

w[PID]  ;=  1  +  PID; - the  initial  positions 

end  ; 


action, recovery 

while  w(PID]  ^  0  do - while  haven’t  exited  the  tree 

where  :=  w[PID]; - current  heap  location 

done  :=  d[where]; - doneness  of  this  subtree 

if  done  then  w[PlD]  :=  where  div  2; - move  up  one  level 

elseif  not  done  A  where  >  N  —  I  then  — at  a  leaf 

if  x[where— W]  =  0  then  x[where— iV)  :=  1; - initialize  leaf 

elseif  x[where— Af]  =  1  then  d[whcre]  :=  1; - indicate  "done” 

ft 

elseif  not  done  A  where  <  N  —  1  then - interior  tree  node 

left  :=  d[2* where];  right  ;=  d[2»where+l]; - read  lefl/right  child  values 

if  left  A  right  then  dfwhere]  :=  1; - both  children  done 

elseif  not  left  A  right  then  w[PlDj  :=  2*where;  - go  left 

elseif  left  A  not  right  then  w[PlD]  :=  2*wherr  - go  right 

elseif  not  left  A  not  right  then - both  subtrees  are  not  done 

- move  down  according  to  the  PID  bit 

if  not  PlD[log(where)]  then  w[PlD]  :=  2+where;  — move  left 
elseif  PI D[log( where)]  then  w[PID]  ;=  2*where+l;  — move  right 
ft 
fi 
fl 

od 

end 
parend  . 


Figure  5:  Algorillini  X . 


Appendix:  Algorithm  X 
pseudocode 

Here  we  give  a  detailed  pseudocode  for  algorithm  X. 

In  the  algorithm  X  pseudocode,  the  action,  recov¬ 
ery  end  construct  of  [SS  83]  is  used  to  denote  tlie  ac¬ 
tions  and  tlie  recovery  procedures  for  the  processors. 
In  tlie  algorithm  this  signifies  that  an  action  is  al.so  its 
own  recovery  action,  should  a  processor  fail  at  any  |)oint 
within  the  action  block. 

'File  notation  “PlD[log(k)]'’  is  used  to  denote  the 
binary  true/false  value  of  the  [log(^)J-th  bit  of  the 
log{Af)-bit  long  binary  representation  of  PID,  where  the 
most  significant  bit  is  the  bit  number  0,  and  the  least 
significant  bit  is  bit  number  logjV.  Finally,  div  stands 
for  integer  division  with  truncation. 


Remark  6  The  action/recovery  construct  can  be  im¬ 
plemented  by  appropriately  checkpointing  the  instruc¬ 
tion  counter  in  stable  storage  as  the  last  instruction  of 
an  action,  and  reading  the  instruction  counter  upon  a 
restart.  We  are  not  providing  further  details  here. 

Remark  7  The  algorithm  can  be  used  to  .solve  II’m/c- 
AU  “in  place”  using  the  array  x[]  as  a  tree  of  height 
logy  with  the  leaves  \\N and  doubling  up  the 
processors  at  the  leaves,  and  using  x[N]  as  the  final  el¬ 
ement  to  be  initialized  and  used  as  the  algorithm  ter¬ 
mination  sentinel.  With  this  modification,  array  d]]  is 
not  needed.  The  asymptotic  efficiency  of  the  algorithm 
is  not  affected. 
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